1 Learning Expected Points

In this notebook I develop and explore an expected points model at the play level for evaluating college football offenses and defenses. The goal of this analysis is to place a value on offensive/defensive plays in terms of their contribution’s to a team’s expected points.

The data used is from college football games from 2003 to present. Each observation represents one play in a game, in which we know the team, the situation (down, time remaining), and the location on the field (yards to go, yards to reach end zone). We have information about the types of plays called as well in a text field.

1.1 Sequences of Play

For each play in a game, I model the probability of the next scoring event that will occur within the same half. This means the analysis is not at the drive level, but at what I dub the sequence level. Suppose a team has the ball on offense to start the first half. The next scoring event can take on one of seven outcomes:

  • Touchdown (7 points)
  • Field goal (3 points)
  • Safety (2 points)
  • No Score (0 points)
  • Opp safety (2 points)
  • Opp field goal (-3 points)
  • Opp touchdown (-7 points)

If the team on offense drives down and scores a TD/FG, this will end the sequence. If the team on offense does not score but punts or turns the ball over, the sequence will continue with the other team now on offense. The sequence will continue until either one team scores, or the half comes to an end. From this, a sequence begins at kickoff and ends at the next kick off.

Suppose we have two teams, A and B, playing in a game. Team A receives the opening kickoff, drives for a few plays, and then punts. Team B takes over, which starts drive 2, and they drive for a few plays before also punting. Team A then manages to put together a drive that finally scores.

All plays on these three drives are one sequence. The outcome of this sequence is the points scored by Team A - if they score a touchdown, their points from this sequence is 7 (assuming for now they make the extra point). Team B’s points from this sequence is -7 points.

When Team A kicks off to Team B to start drive 4, we start our next sequence, which will end either with one team scoring or at the end of the half. We’ll then start over with a new sequence in the second half.

Why model the outcome of sequences rather than individual drives? Individual plays have the potential to affect both team’s chances of scoring, positively or negatively, and we want our model to directly capture this. If an offense turns the ball over at midfield, they are not only hurting their own chances of scoring, they are increasing the other team’s chance of scoring. The value of a play in terms of expected points is function of how both team’s probabilities are affected by the outcome.

1.2 Defining Expected Points

A team’s expected points is sum of the probability of each possible scoring event multiplied by the points of that event. For this analysis, I assume that touchdowns equate to 7 rather than 6 points, assuming that extra points will be made. I can later bake in the actual probability of making extra points, but this will be a simplification for now.

For a given play \(i\) for Team \(A\), we can compute Team A’s expected points using the following:

\[ {Expected Points}_A = \\Pr(TD)*7 + \\ Pr(FG)*3 + \\Pr(Safety)*2 + \\ Pr(No Score)*0 + \\ Pr(Opp. Safety)*-2 + \\ Pr(Opp. FG) * -3 +\\ Pr(Opp. TD) * -7 \]

How do we get the probabilities of each scoring event? We learn these from historical data by using a model - I train a multinomial logistic regression model on many seasons worth of college football plays to learn how situations on the field affect the probability of the next scoring event.

1.3 Next Scoring Event

The outcome for our analysis is the NEXT_SCORE_EVENT. Each play in a given sequence contributes to the eventual outcome of the sequence. Here we can see an example of one game and its drives:

For this game, we can filter to the plays that took place in the lead up to first score event. In this case, the first sequence included one drive and ended when Texas A&M kicked a field goal.

If we look at another sequence in the second half, there were multiple drives before a team was able to score in that sequence. The next scoring event is always defined from the perspective of the offense.

2 Modeling Expected Points

Our goal is to understand how individual plays contribute to a team’s expected points, or the average points teams should expect to have given their situation (down, time, possession).

For instance, in the first drive of the Texas A&M-Florida game in 2012, Texas A&M received the ball at their own 25 yard line to open the game. The simplest intuition of expected points is to ask, for teams starting at the 25 yard line at the beginning of a game, how many points do they typically go on to score? The answer is to look at all starting drives with 75 yards to go and see what the eventual next scoring event was for each of these plays - we take the average of all of the points that followed from this situation.

In this case, this means teams with the ball at their own 25 to start the game generally obtained more points on the ensuing sequence than their opponents, so they have a slightly positive expected points.

But, this is also a function of the down. If we look at the expected points for a team in this situation in first down vs a team in this situation for fourth down, we should see a drop in their expected points - by the time you hit fourth down, if you haven’t moved from the 25, your expected points drops into the negatives, as you will now be punting the ball back to your opponent and it becomes more probable that they score than you.

The fact that the expected point changes based on the down and yard line allows us to look at the difference between expected points from play to play - the difference in expected points based on how the situation changed allows us to compute the Expected Points Added from a single play.

For any given play, we get a sense of the expected points a team can expect from their situation. For instance, if we look at all total plays in a game, how do expected points vary as a function of a team’s distance from their opponent’s goal line?

This should make sense - if you’re backed up against your own end zone, your opponent has higher expected points because they are, historically, more likely to have the next scoring event, either by gaining good field advantage after you punt or by getting a safety. We can see this if we just look at the proportion of next scoring events based on the offense’s position on the field.

From this, when we see an offense move the ball up the field on a given play, we will generally see their expected points go up. The difference in expected points before the snap and after the snap is the value added (positively or negatively) by the play.

But, it’s not just position on the field - it’s also about the situation. If we look at how expected points varies by the down, we should see that fourth downs have lower expected points.

We also have other features like distance to convert the first down (filtering here to plays with a maximum of 30 yards to go, as we start to run out of data at higher values and it looks wonky).

And we also have info on time remaining in the half - as we might expect, the proportion of drives leading to no scoring goes up as the amount of time remaining in the half goes down.

We use all of this historical data to learn the expected points from a given situation, then look at the difference in expected points from play to play - this is the intuition behind how we will value individual plays, which we can then roll up to the offense/defense/game/season level.

2.1 Building Models

How do these various features like down, distance, yards to goal, and time remaining affect the probability of the next scoring event? We use a model to learn this relationship from historical plays. I’ll now proceed to building the model which I’ll use for the bulk of the analysis.

I’ll set up training, validation, and test sets based around the season. I’m mostly going to build the model using plays from the 2010 season onwards, as the data quality of the play by play data starts to get worse the further back we go, though I’ll do some backtesting of the model on older seasons.

I’m going to use the seasons of 2010-2018 as my main training set, building and evaluating the model using a leave-one-season out approach, akin to k-fold cross validation using seasons as the folds. I’ll use the 2019-2020 seasons as a validation set, and leave 2021 as my test set which I won’t look at till later on.

# full plays
plays_full = plays_data_score_events %>%
        filter(PLAY_TYPE != 'Kickoff') %>%
        select(GAME_ID,
               DRIVE_ID,
               PLAY_ID,
               SEASON,
               HOME,
               AWAY,
               OFFENSE,
               DEFENSE,
               OFFENSE_SCORE,
               DEFENSE_SCORE,
               SCORING,
               PLAY_TEXT,
               PLAY_TYPE,
               NEXT_SCORE_EVENT_HOME,
               NEXT_SCORE_EVENT_HOME_DIFF,
               NEXT_SCORE_EVENT_OFFENSE,
               NEXT_SCORE_EVENT_OFFENSE_DIFF,
               YARD_LINE,
               HALF,
               PERIOD,
               MINUTES_IN_HALF,
               SECONDS_IN_HALF,
               DOWN,
               DISTANCE,
               YARD_LINE,
               YARDS_TO_GOAL) %>%
        filter(DOWN %in% c(1, 2, 3, 4)) %>%
        filter(PERIOD %in% c(1,2,3,4)) %>%
        filter(!is.na(SECONDS_IN_HALF)) %>%
        filter(DISTANCE >=0 & DISTANCE <=100) %>%
        filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        mutate(NEXT_SCORE_EVENT_OFFENSE = factor(NEXT_SCORE_EVENT_OFFENSE,
                                                 levels = c("No_Score",
                                                            "TD",
                                                            "FG",
                                                            "Safety",
                                                            "Opp_Safety",
                                                            "Opp_FG",
                                                            "Opp_TD"))) %>%
        arrange(SEASON, GAME_ID, PLAY_ID)

# training set
plays_train = plays_full %>%
        filter(SEASON >= 2010 & SEASON <2019)

# validation set
plays_valid = plays_full %>%
        filter(SEASON >= 2019 & SEASON <= 2020)

# test
plays_test = plays_full %>%
        filter(SEASON > 2020)

# make an initial split based on previously defined splits
valid_split = make_splits(list(analysis = seq(nrow(plays_train)),
                                 assessment = nrow(plays_train) + seq(nrow(plays_valid))),
                               bind_rows(plays_train,
                                         plays_valid))

# test split
test_split = make_splits(
        list(analysis = seq(nrow(plays_train) + nrow(plays_valid)),
             assessment = nrow(plays_train) + nrow(plays_valid) + seq(nrow(plays_test))),
        bind_rows(plays_train,
                  plays_valid,
                  plays_test))

The outcome is the next scoring event, always defined from the perspective of the offense for any given play.

I currently use the following as features for plays in a baseline model:

  • Quarter
  • Seconds Remaining in Half
  • Down
  • Distance (logged)
  • Yards to opponent’s end zone
  • Down and goal indicator for whether the offennse is in a ‘first and goal’ situation

I also include interactions between down and distance, down and yards to end zone, and yards to end zone and seconds remaining. This baseline model doesn’t account for things like offense/defense quality or scoring effects, I’ll train other models later on with these features to get things like ‘Defense Adjusted Expected Points’. But this analysis is focused first and foremost on estimating probabilities

I’ll now set up a recipe for the baseline model.

baseline_recipe = recipe(NEXT_SCORE_EVENT_OFFENSE ~.,
                         data = plays_train) %>%
        update_role(all_predictors(),
                    new_role = "ID") %>%
        update_role(
                c("GAME_ID",
                  "DRIVE_ID",
                  "PLAY_ID",
                  "SEASON",
                  "HOME",
                  "AWAY",
                  "OFFENSE",
                  "DEFENSE",
                  "SCORING",
                  "OFFENSE_SCORE",
                  "DEFENSE_SCORE",
                  "PLAY_TEXT",
                  "PLAY_TYPE",
                  "NEXT_SCORE_EVENT_HOME",
                  "NEXT_SCORE_EVENT_HOME_DIFF",
                  "NEXT_SCORE_EVENT_OFFENSE_DIFF",
                  "YARD_LINE",
                  "MINUTES_IN_HALF",
                  "HALF"),
                new_role = "ID") %>%
        step_mutate(PERIOD_ID = PERIOD,
                    role = "ID") %>%
        # features we're inheriting
        update_role(
                c("PERIOD", 
                "SECONDS_IN_HALF",
                "DOWN",
                "DISTANCE",
                "YARDS_TO_GOAL"),
                new_role = "predictor") %>%
        # filters for issues
        step_filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        step_filter(YARD_LINE <= 100 & YARD_LINE >=0) %>%
        step_filter(YARDS_TO_GOAL <=100 & YARD_LINE >=0) %>%
        step_filter(DOWN %in% c(1, 2, 3, 4)) %>%
        step_filter(DISTANCE >=0 & DISTANCE <=100) %>%
        step_filter(SECONDS_IN_HALF <=1800) %>%
        step_filter(!is.na(SECONDS_IN_HALF)) %>%
        step_filter(PERIOD_ID == 1 | PERIOD_ID == 2 | PERIOD_ID == 3 | PERIOD_ID == 4) %>%
        # create features
        step_mutate(KICKOFF = case_when(grepl("kickoff", tolower(PLAY_TEXT)) | grepl("kickoff", tolower(PLAY_TYPE))==T ~ 1,
                                        TRUE ~ 0)) %>%
        step_mutate(TIMEOUT = case_when(grepl("timeout", tolower(PLAY_TEXT)) ~ 1,
                                        TRUE ~ 0)) %>%
        step_filter(TIMEOUT != 1) %>%
        step_filter(KICKOFF != 1) %>%
        step_mutate(DOWN_TO_GOAL = case_when(DISTANCE == YARDS_TO_GOAL ~ 1,
                                             TRUE ~ 0)) %>%
        step_mutate(DOWN = factor(DOWN)) %>%
        step_mutate(PERIOD = factor(PERIOD)) %>%
        step_log(DISTANCE, offset =1) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_novel(all_nominal_predictors(),
                   new_level = "new") %>%
        step_interact(terms = ~ DISTANCE:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL*SECONDS_IN_HALF) %>%
        check_missing(all_predictors()) %>%
        step_normalize(all_numeric_predictors())

2.2 Workflows

I’ll define the model I’ll be using here, which is a multinomial logistic regression.

# from glmnet
multinom_mod = multinom_reg(
  mode = "classification",
  engine = "glmnet",
  penalty = 0,
  mixture = NULL
)

I’ll then create a workflow.

# create multinomial workflow
multinom_wf = workflow() %>%
        add_recipe(baseline_recipe) %>%
        add_model(multinom_mod)

# workflow settings
# metrics
class_metrics<-metric_set(yardstick::roc_auc,
                          yardstick::mn_log_loss)

# control for resamples
keep_pred <- control_resamples(save_pred = TRUE, 
                               save_workflow = TRUE,
                               allow_par=T)

I’ll manually define resamples based on the seasons - rather than doing k-fold cross validation, I’ll assign each season to be a fold and train and assess the model leaving one season out at a time.

## Loading required package: iterators
## Loading required package: parallel

2.3 Training

Now I’ll train and assess the model on the training set via leave-one-season-out cross validation, then I’ll refit the model to the entire training set.

# fit to resamples
resamples_multinom = multinom_wf %>%
        fit_resamples(data = plays_train,
                      metrics = class_metrics,
                      resamples = manual_resamples,
                      control = keep_pred,
                      verbose=T)
## ℹ The workflow being saved contains a recipe, which is 362.9 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
# # # save locally so as to not need to retrain everytime
# write_rds(resamples_multinom,
#      file = here::here("models", "resamples_expected_points.Rds"),
#      compress = "gz")
# fit the model to the whole training set
fit_multinom = multinom_wf %>%
        fit(data = plays_train)

2.4 Resampling Performance

How did the model perform on each resample? I’m looking at the logloss and area under the receiver operating characteristic curve (roc_auc). The logloss is only meaningful by comparison, I’ll compare how the model does to a null model on the validation set.

2.5 Inference

Understanding partial effects from a multinomial logit is already difficult, and I’ve thrown a bunch of interactions in there to make this even more difficult.

I’ll look at predicted probabilities using an observed values approach for particular features (using a sample rather than the full dataset to save time). This means taking the model and then altering the feature of interest for every observation and taking the average predicted probability for each outcome across all observations.

How is the probability of the next scoring event influenced by where the offense has possession?

How is this affected by the down?

How does this translate into expected points?

2.6 Validation Set

We can evaluate the model via our leave-one-out approach, but we’ll also predict the validation set as an additional check. I’ll compare performance relative to a null model that simply predicts the incidence rate of each outcome in the training set.

What’s the log loss for each outcome?

3 Examining Results

I’ll now start diving into the predictions for individual plays as a means to evaluate plays and teams.

It’s worth noting that we might see some season-level differences that make comparison across seasons difficult, since the predictions are all coming from slightly different models due to resampling.

I’ll get Expected Points Added for all non scoring plays. This part can be a little wobbly, due to data quality issues with defining sequences. The basic thought here is to say, at the start of a play, we know the expected points for a team in that situation, EP_Pre. We then look to the next play to see the expected points for the team after the result of the previous play, EP_Post. EP_Added is the difference between these two outcomes from the perspective of the offense.

This means that if the ball is turned over, but not scored, the team on offense becomes the defense and the sign of the expected points on the next play flips for their calculation. For events that produce touchdowns, EP_Added is empty, but I create another feature simply called Points_Added in which I take the difference between EP_Pre and the points scored on the play, 7 for touchdowns, 3 for FGs, 2 for safeties.

3.1 Game Results

I’ll look at a few games, play by play, to get a sense of how this is looking. I’ll pick one game completely at random, in no way influenced by my fandom: Texas A&M vs Alabama in 2012.

What were the most impactful plays in this game in terms of expected points added? I’ll look at the 15 plays in this game with the largest absolute change.

Kind of interesting - this game is remembered for a lot of plays by Johnny Manziel, but the most impactful plays in the game in terms of expected points changes were actually turnovers forced by the A&M defense.

However, this doesn’t include scoring plays. The Points_Added measure looks at the difference between actual points scored vs expected points from the situation.

In this case, we can still use the expected points of the situation to give us a measure of how many points the play added. In the case of Christine Michael’s 1 yard TD run, the points added are pretty low, as you’d expect teams to score from that situation. By comparison, AJ McCarron’s 54 yard pass on 3rd and 6 in the 4th quarter had a much higher points added, as they were in a third down situation at the midfield and made a huge play. For this reason, what I’m calling Points Added has been dubbed by others as a measure of a team’s offensive explosiveness. For most of my analyses I’m condensing Expected Points Added and points Added into one metric that I’m calling Predicted Points Added.

What’s cool is that we can do this (right now) for any play from any game from 2010-2020. For instance, we can look at a team and ask, what were its highest predicted points added plays in each season? This is Wisconsin’s top plays for each year.

What were the highest predicted points added plays (including scoring plays) from each season?

Looks like there’s a data quality issue with the Iowa vs Northern Iowa game, the yards to go looks to be incorrect there which is making that play look like it was a big yardage gain. We also notice some data quality issues if we look at highest expected points added plays, not including scoring plays.

For that 2011 play, it’s 4th and 58 with 93 yards to go and the pass is complete for a loss of seven yards, and yet the model considers this a play worthy of 9 expected points? The heck? Turns out this, and most of these, are actually just issues of data quality with ESPNs play by play data.

In this case, the play by play data has a bizarre sequence in which Troy is listed as punting, then completing a loss on 4th and 58, while still having possession on the next play with a 1st and 10. Since the expected points added calculation is based on the difference between plays, that phantom 4th and 58 play throws off the value of the next play. This is unfortunately quite common in ESPNs play by play data, especially as we go further back.

3.2 Team Offenses by Season

Having scored all individual plays, we can now roll this up to whatever level of analysis we’re interested in. We can, for instance, look at a team’s offense for each season over this time period. Continuing to select a team at random, we’ll look at Texas A&M’s offense in terms of predicted/expected points game by game over this time period. The line in navy, EPA_Average, looks at a team’s expected points added per play without including scoring plays. The line in light blue, PPA_Average, looks at all plays, including scoring.

If we use these as general measures of the efficiency of an offense in a season/game, where does A&M place compared to all teams? What is good? I’ll plot the distribution of all teams with at least 400 plays in a season, then overlay where A&M ranks.

Based on either metric, we can see that A&M’s offense in 2012 and 2013 were among the best in college football at the time. It also looks like A&M’s 2018 and 2020 offense were pretty darn good as well.

Let’s look at another team that we would expect to be really strong this entire time period: Ohio State.

What about a team that’s been up and down during this time period? Some lowly team, like Texas.

What about Florida State? We would expect them to be towards the top in the Jimob/Jameis era and then fall back to Earth during the Taggart era.

Hmm. That’s kind of interesting - I’ve heard people discuss Jimbo’s offense as lacking in explosiveness, which some argue is the difference between the expected points and the predicted points. This definitely wasn’t the case in 2013, but generally FSU had a pretty efficient and explosive offense until their collapse in 2017.

3.3 Team Offense and Defense by Game

But this is only looking at one side of the ball at the season level. We can look at team’s defense in the same way in terms of the average expected/predicted points allowed by the offenses they face. In this case, you want your defense to have a negative value - this indicates that offenses weren’t able to generate points against your defense. I’ll look at a team’s offense and defense side by side in terms of Predicted Points Added - meaning I am including scoring plays. We would expect a good team to be one whose offense (in blue) generates more points per play than they allow on defense (in red).

I’ll look at a team like Wisconsin. They seem to generally have a pretty good defense, but struggle in terms of their ability to generate points on offense.

For a dominant team like Alabama, we should see their offensive efficiency be high and their defensive efficiency be high as well - the blue should almost always outstrip the red.

Evidently they were so good at stopping Tennesee in 2011 that it breaks my axis.

For a mediocre team, like Kansas, we should see the opposite type of graph here.

3.4 Ranking Teams by Season

We can then aggregate to a team’s offensive performance within a full year to rate their overall offensive/defensive efficiency. We can also break this down by passing vs rushing plays on both sides of the ball.

For the purpose of evaluating a team’s defense in this analysis, I’ll flip the sign of a defense’s points, which are currently scored from the perspective of the offense, so that positive is always good for a team.

Putting this all together, I can rank a teams offense/defense by year and then sort to see where the top teams of all time tend to rank. Here are the top 50 teams using a composite score of both, including only offenses and defenses with at least 400 plays in a season.

And here are the 25 worst teams using the same criterion.

We can visualize all of this by placing every team based on its overall offensive/defensive efficiency.

We can also break this down by conference year over year. For instance, where do we place SEC teams in each of these seasons?

What about the Big Ten?

3.5 Ranking Teams by Pass/Run Efficiency

We can break this down further to examine a team’s expected points based on pass/run.

Filtering to teams with at least 200 such plays in a season, which teams had the most efficient passing offense in terms of predicted points added?

What about run offense?

Flipping this around, which defenses yielded the most points to passing/run?

What about the best defenses? Rutgers in 2012 with the best run defense? Really? Keep in mind that these ratings are unconditional based on opponent strength, though evidently Rutgers did have a bunch of defenders drafted in 2013, so..?

Putting this all together we can break down teams overall as before, only now explicitly looking at teams based on their pass and run offense/defense efficiency. Florida State 2013 still jumps to the top, evidently their passing defense that year was fantastic in addition to their passing offense. The picture is mostly unchanged from looking at just the overall numbers, but breaking it down can reveal which teams were more balanced while others were pass/run heavy.

We can also look at individual teams year over year to see how their offense/defense/overall.

For instance, Oregon was great in the early 2010s then had a few years in which they were down.

Florida State had that amazing year, but has really fallen back after Jimbo’s bad year and eventual departure.

Wisconsin has generally been pretty good but not quite elite during this time period, people usually hound them for their passing game lately, how do they compare year over year?

What about a historically weaker team that has been on the rise lately? Iowa State?

And of course the team I keep randomly selecting, Texas A&M.